Before building predictive models and advanced macro-visualizations, it is crucial to understand the fundamental distributions and behavioral correlations within our customer base. This phase is divided into Exploratory Data Analysis (EDA) and Confirmatory Data Analysis (CDA).
1. Loading packages
Code
# Load all required libraries for advanced interactive analyticslibrary(dplyr)library(tidyr)library(ggplot2)library(plotly)library(survival)library(networkD3)library(scales)library(broom) # For tidying survival models# Load the core application datasetdf <-readRDS("app_data.rds")# Define FinTech Color Palettefintech_colors <-c("#1c2541", "#0077b6", "#00b4d8", "#90e0ef", "#caf0f8")risk_color <-"#e74c3c"
2. Demographics (EDA)
Our EDA phase focuses on understanding who our customers are. Using ggplot2 and plotly in our application, stakeholders can dynamically filter these demographic distributions by gender and customer segment.
2.1 Income and Education Distributions
Our EDA phase focuses on understanding who our customers are. We visualized the distribution of customers across education levels to see how age correlates with educational attainment.
Code
p_eda <-ggplot(df, aes(x = education_level, y = age, fill = education_level)) +geom_boxplot(alpha =0.8, color ="#1c2541", outlier.color = risk_color) +scale_fill_manual(values =colorRampPalette(fintech_colors)(length(unique(df$education_level)))) +theme_minimal() +labs(title ="Age Distribution by Education Level", x ="Education Level", y ="Age") +theme(legend.position ="none", panel.grid.minor =element_blank())# Convert to interactive Plotly objectggplotly(p_eda) %>%config(displayModeBar =FALSE)
3. Behavioral Analysis (CDA)
The CDA phase seeks to test hypotheses regarding how customer behavior impacts overall business metrics like Customer Lifetime Value (CLV) and Net Promoter Score (NPS).
3.1 Interactive Correlation Analysis
We deployed interactive scatter plots that allow users to map specific X and Y variables against each other (e.g., Total Transactions vs. Satisfaction Score).
By overlaying a statistical trendline (using the lm method), stakeholders can immediately confirm or reject suspected correlations.
Code
ggplot(df, aes(x = tx_count, y = satisfaction_score)) +geom_point(alpha =0.5, color ="#2c3e50") +geom_smooth(method ="lm", color ="#e74c3c", se =FALSE) +theme_minimal() +labs(title ="Total Transactions vs Satisfaction Score",x ="Total Transaction Count",y ="Satisfaction Score (1-5)")
Transaction vs. Satisfaction Hypothesis
We plotted transaction counts against satisfaction scores. The red trendline highlights a concerning friction point: higher transaction volumes do not strictly guarantee higher satisfaction.
Code
p_cda <-ggplot(df, aes(x = tx_count, y = satisfaction_score, text =paste("Support Tickets:", support_tickets_count))) +geom_point(alpha =0.4, color ="#0077b6") +geom_smooth(method ="lm", color = risk_color, se =TRUE, fill ="#ffcccb", alpha =0.3) +theme_minimal() +labs(title ="Hypothesis Testing: Usage vs Satisfaction",x ="Total Transaction Count",y ="Satisfaction Score (1-5)")ggplotly(p_cda, tooltip =c("x", "y", "text")) %>%config(displayModeBar =FALSE)
3.2 Transaction Volume Breakdown
We aggregated the total dollar volume flowing through four distinct transactional categories:
Payments
Transfers
Withdrawals
Deposits
This breakdown acts as the foundational baseline that later informs the proportional weights used in our Macro Cash Flow Sankey Diagram.
Code
# Calculate the summaries using the actual columns in the dataset!library(dplyr)tx_sums <- df %>%group_by(preferred_transaction_type) %>%summarise(Total_Volume =sum(total_tx_volume, na.rm =TRUE) ) %>%arrange(desc(Total_Volume)) # Sorts it from highest to lowest volume# Print the final data table to the webpagetx_sums
To move beyond basic demographic segmentation, we applied a K-Means Clustering algorithm (k=3) to automatically group users based on their transaction frequency and monetary volume to discover hidden behavioral groupings within our user base. By scaling and analyzing tx_count (Transaction Frequency) against total_tx_volume (Monetary Value), the algorithm identified three distinct customer clusters without any manual labeling.
Code
# 1. Prepare data and apply K-Meansset.seed(42)cluster_data <- df %>%select(tx_count, total_tx_volume) %>%drop_na()kmeans_result <-kmeans(scale(cluster_data), centers =3)cluster_data$Cluster <-paste("Persona", kmeans_result$cluster)# 2. Interactive Cluster Plotp_cluster <-ggplot(cluster_data, aes(x = tx_count, y = total_tx_volume, color = Cluster)) +geom_point(alpha =0.6, size =2) +scale_color_manual(values =c("#0077b6", "#00b4d8", risk_color)) +scale_y_continuous(labels = scales::dollar_format()) +theme_minimal() +labs(title ="Algorithmic Behavioral Segmentation",x ="Transaction Frequency",y ="Total Monetary Volume")ggplotly(p_cluster) %>%layout(legend =list(orientation ="h", x =0.3, y =-0.2))
5. Transition to Advanced Analytics
With a solid understanding of our demographics and baseline transaction volumes established in this EDA/CDA phase, we confidently move forward to our predictive and advanced modeling phases. The insights gathered here directly feed into the variables chosen for our Linear Regression Risk Simulator and Kaplan-Meier Survival curves.
5.1 RFM Analysis
While K-Means provides algorithmic grouping, RFM (Recency, Frequency, Monetary) provides established business logic. We categorized our users based on their `R_Score`, `F_Score`, and `M_Score` to identify key groups like “Champions” and “At-Risk” users. In our Shiny application, this is visualized as an interactive Treemap.
Code
## 4. Advanced Segmentation (RFM Analysis)#| fig-align: "center"library(dplyr)library(ggplot2)# Calculate the size of each RFM Segmentrfm_summary <- df %>%group_by(RFM_Segment) %>%summarise(Customer_Count =n()) %>%arrange(desc(Customer_Count))# Plotting the Segment Distributionggplot(rfm_summary, aes(x =reorder(RFM_Segment, Customer_Count), y = Customer_Count, fill = RFM_Segment)) +geom_bar(stat ="identity", alpha =0.8) +coord_flip() +theme_minimal() +theme(legend.position ="none") +labs(title ="Customer Distribution by RFM Segment",x ="RFM Segment",y ="Number of Customers")
Interactive RFM Treemap
RFM Analysis provides established business logic. Here is the entire distribution of our customer base grouped by their Recency, Frequency, and Monetary scores.
To understand the macro-dynamics of our user base, we explored the dimensions of Time (when users leave) and Capital (where money flows) by using Survival Analysis (Time-to-Churn).
We applied a Kaplan-Meier Survival Model to map the probability of a customer remaining active over time. By using customer_tenure and the binary churn_event indicator, we can pinpoint the exact month where retention efforts must be intensified.
Code
library(survival)# Build the Kaplan-Meier survival objectkm_fit <-survfit(Surv(customer_tenure, churn_event) ~1, data = df)# Plot the baseline survival curve using base R (survminer is used in the app)plot(km_fit, xlab ="Customer Tenure (Months)", ylab ="Survival Probability", main ="Kaplan-Meier Estimate of Customer Retention",col ="#0077b6", lwd =2, conf.int =TRUE)abline(v =3, col ="red", lty =2) # Highlight Month 3 risk zone
Code
# Build Kaplan-Meier Modelkm_fit <-survfit(Surv(customer_tenure, churn_event) ~1, data = df)km_data <-tidy(km_fit) # Use broom to tidy for ggplot# Interactive Survival Curvep_surv <-ggplot(km_data, aes(x = time, y = estimate)) +geom_step(color ="#1c2541", size =1) +geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha =0.2, fill ="#0077b6") +geom_vline(xintercept =3, color = risk_color, linetype ="dashed") +scale_y_continuous(labels = scales::percent_format()) +theme_minimal() +labs(title ="Kaplan-Meier Retention Curve",x ="Customer Tenure (Months)",y ="Probability of Retaining User") +annotate("text", x =4.5, y =0.6, label ="High Risk Zone", color = risk_color)ggplotly(p_surv) %>%config(displayModeBar =FALSE)
5.3 Macro Cash Flow (Sankey Network)
To visualize liquidity, we mapped the flow of transactions using our actual transactional dataset. Deposits represent capital entering the FinRetain ecosystem (“App Wallet”), while Payments, Transfers, and Withdrawals represent capital flowing out to various destinations.
The core of our Visual Analytics solution is transitioning from reactive data to proactive forecasting. We built a baseline Multiple Linear/Logistic Regression Model to predict churn_probability based on user behavior metrics. This mathematical model powers the “Risk Simulator” in our Shiny App, allowing stakeholders to use slider inputs to change these variables and instantly recalculate the predicted risk.
Code
# Build a baseline predictive model for Churn Probabilitypredictive_model <-lm(churn_probability ~ support_tickets_count + tx_count + satisfaction_score, data = df)# Display the model summary to confirm statistical significance of variablessummary(predictive_model)
Call:
lm(formula = churn_probability ~ support_tickets_count + tx_count +
satisfaction_score, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.20780 -0.03801 0.01245 0.04759 0.16247
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.314e-01 2.192e-03 196.817 < 2e-16 ***
support_tickets_count -7.181e-05 2.988e-04 -0.240 0.81
tx_count -2.727e-06 4.881e-07 -5.587 2.32e-08 ***
satisfaction_score -2.119e-02 5.163e-04 -41.037 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06592 on 48719 degrees of freedom
Multiple R-squared: 0.03417, Adjusted R-squared: 0.03411
F-statistic: 574.6 on 3 and 48719 DF, p-value: < 2.2e-16